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Abstract 

It is widely believed that the Internet's AS-graph degree distribution obeys a power-law form. Most of 
the evidence showing the power-law distribution is based on BGP data. However, it was recently argued 
that since BGP collects data in a tree-like fashion, it only produces a sample of the degree distribution, 
and this sample may be biased. This argument was backed by simulation data and mathematical analysis, 
which demonstrated that under certain conditions a tree sampling procedure can produce an artificial 
power-law in the degree distribution. Thus, although the observed degree distribution of the AS-graph 
follows a power-law, this phenomenon may be an artifact of the sampling process. 

In this work we provide some evidence to the contrary. We show, by analysis and simulation, that 
when the underlying graph degree distribution obeys a power-law with an exponent 7 > 2, a tree-like 
sampling process produces a negligible bias in the sampled degree distribution. Furthermore, recent data 
collected from the DIMES project, which is not based on BGP sampling, indicates that the underlying 
AS-graph indeed obeys a power-law degree distribution with an exponent 7 > 2. By combining this 
empirical data with our analysis, we conclude that the bias in the degree distribution calculated from 
BGP data is negligible. 

Classification: Network Computing, Internet Topology Models 



1 Introduction 

1.1 Background and Motivation 

The connectivity of the Internet crucially depends on the relationships between thousands of Autonomous 
Systems (ASes) that exchange routing information using the Border Gateway Protocol (BGP). These rela- 
tionships can be modeled as a graph, called the AS-graph, in which the vertices model the ASes, and the 
edges model the peering arrangements between the ASes. 

Significant progress has been made in the study of the AS-graph's topology over the last few years. 
A great deal of effort has been spent measuring topological features of the Internet. Numerous research 
projects IIFFF991 lABOOl ICCG+02l ILBCX031 IWGJ+02l IWJ02l lBT02l IBGW051 IBGW04I IKRROll ISSMl 



ILC031 IBBOTI IBBBCOIi IRN041 ITPSFOIi lTGJ+021 iGTOOl lBS02l have ventured to capture the Internet's 
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topology. Based on these and other topological studies, it is widely believed that the Internet's degree 
distribution has a power-law form with an exponent 2 < 7 < 3, i.e., the fraction of vertices with degree k is 
proportional to k^''. Most of these studies are based on BGP data from sources such as Route Views liRouOSJ 
and CIDR Reports l,BSH03.l . 

1.2 Related Work 

It was recently argued IILBCX03[ ICM04al ICM04b[ lPR04l IACKM051 ISSMl ICCG+021 lTGJ+021 IWGJ+021 
that the evidence obtained from the analysis of the Internet graphs sampled as described above may be bi- 
ased. In a thought-provoking article, Lakhina et al. ItLBCXOBI claimed that a power-law degree distribution 
may be an artifact of the BGP data collection procedure. They suggest that although the observed degree 
distribution of the AS -graph follows a power-law distribution, the degree distribution of the real AS -graph 
might be completely different. They claim that with tree-like sampling, such as that employed by BGP, an 
edge is much more likely to be visible, i.e., included in the sampled graph, if it is close to the root. More- 
over, in tree-like sampling, high-degree vertices are more likely to be encountered early on, and therefore 
they are sampled more accurately than low-degree vertices. They backed this argument with simulations 
that indicated that under some conditions, a BFS sampling process in itself is sufficient to produce a power- 
law degree distribution in the sample, even when the underlying graph is a sufficiently-dense Erdos-Renyi 
IIER6O I graph. 

Subsequently IICM04al ICM04bll gave a mathematical foundation to the argument of IILBCX03 i They 
showed that BFS tree sampling produces a power-law degree distribution, with an exponent of 7 = 1, both 
for Poisson-distributed random graphs and for 5-regular random graphs. In other words, they showed that a 
tree sampling process may have a significant bias, and may produce an artificial power-law — albeit with an 
exponent that is very different from that observed in the AS -graph. 

On the other hand, Petermann and De Los Rios IIPR04II showed that for single-source tree sampling of 
a BA graph IIBA99II . the exponent obtained for the power-law distribution is only slightly under-estimated. 
This cannot be viewed as strong evidence against the argument of IILBCX03II . since the analysis assumes 
a BA-model, which is a highly idealized evolution model of the AS -graph. However, this result does indi- 
cate that at least in some power-law graphs, a tree sample does not create a significant bias in the degree 
distribution. 

More recently, fACKMOSl analyzed the degree distribution discovered by a BFS tree sampling process 
over a general graph. Among other results, they gave a general, but rather unwieldy, expression of the degree 
distribution of the sampled graph, depending on the underlying graph degree distribution. This work is the 
starting point of our analysis: we use the results of IIACKM05II to analyze the sampled degree distribution 
when the underlying graph has a power-law degree distribution. 

A new development in the empirical measurement of the Internet topology was suggested recently by 
Shavitt and Shir [SS05 I. In this work they describe an Internet mapping system called DIMES. DIMES 
is a distributed measurement infrastructure for the Internet that is based on the deployment of thousands 
of light weight measurement agents around the globe. Unlike BGP data, that is sampled in a tree-like 
fashion, DIMES executes traceroutes among all pairs of its agents, collects the router-level results, 
and aggregates the AS-graph from this data. Because of this measurement methodology, DIMES discovers 
significantly more links than BGP-based systems. The salient point for our purposes is that DIMES data 
too shows a power-law degree distribution, with an exponent 2 < 7 < 3 — and since the DIMES system 
uses an all-pairs measurement paradigm, it is difficult to claim that the power-law is an artifact of a tree-like 
sampling. 
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1.3 Contributions 



Our main contribution is our analysis of the degree distribution observed in the BFS tree sample, when 
the underlying graph has a power-law distribution with an exponent 2 < 7 < 3. Under these conditions, 
we prove that the bias in the power-law is negligible: with high probability, the degree distribution of the 
high-degree nodes in the sample also exhibits a power-law, with exactly the same exponent 7. We validate 
our mathematical analysis with simulation results using the DIMES -measured AS-graph as the underlying 
power-law graph. 

Putting this result in the context of the Internet topology, we recall the data collected from the DIMES 
project is not based on BGP-style tree sampling. Nevertheless, DIMES data indicates that the underlying 
AS-graph indeed obeys a power-law degree distribution with an exponent 7 > 2. By combining this empir- 
ical data with our analysis, we conclude that the bias in the degree distribution calculated from BGP data is 
negligible. 

Organization: In the next section we give an overview of the results of II ACKMOS *! we rely on. In 
Section [3] we show our main result, that the bias in the degree distribution of a tree-sampled power-law 
graph is negligible. In Section |4] we sketch an alternative, more rigorous, analysis of a weaker result, that 
validates some of the approximations we used in our main result. Section [5] describes the results of our 
simulations. We conclude with Section [6l 



2 Highlights of HACKM05I1 
2.1 The General Framework 

The proof of our result is based on the model, sampling process, and the main results described in IIACKM05II . 
In this section we give a brief introduction to the main results we need. 

Notation Throughout the paper we use G = {V, E) to denote the underlying graph, and n = \V\io denote 
the number vertices. 



Definition 1 We say that {aj} is a degree distribution ofGifG contains aj ■ n nodes of degree j. 

In the IIACKM05II model, the graph G is not a given graph but a random graph chosen out of a family of 
graphs obeying a given degree distribution {aj}. The basic setting is the configuration model of IIBol85ll : for 
each vertex of degree k we create k copies, and then define the edges of the graph according to a uniformly 
random matching on these copies. 



2.2 The BFS Tree SampUng Process 

The fACKMOSl model defines a randomized process, that simultaneously produces a random graph G 
obeying the degree distribution {aj}, and a BFS tree T that represents the sample. Note that for a given 
graph G, a BFS algorithm is a deterministic algorithm, but different outcomes are possible, depending on 
the order in which outgoing edges are traversed. In IACKM05I model, a random choice determines this 
order. 

The sampling process is thought of taking place in continuous time. However, for technical reasons the 
authors define a non-standard notion of time, which we denote by a capitalized word (Time). In this model, 
the BFS sample process starts at Time t = 1 with an empty tree T. As the sampling process evolves. Time 
decreases to t = 0, when the sample tree T includes all n nodes (assuming G is connected). 
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Before the process starts, for each vertex v there are deg{v) copies of v. Each copy is given a real-valued 
index chosen uniformly at random from the unit interval [0, 1]. Namely, vertex v has deg{v) indices, chosen 
uniformly and independently at random from the unit interval [0, 1]. At every Time step t two copies are 
matched: one copy is a copy of a vertex already discovered, and the other copy is a copy with index t. Such 
a matched pair forms an edge of the original graph. According to IIACKM05II at Time t the indices of the 
unmatched copies are uniformly random in [0, t). Let the maximum index of a vertex be the maximum of 
all its copies' indices. Then at any Time t, the vertices that have not been discovered yet are precisely those 
whose maximum index is less than t. An edge will be visible, namely, included in the BPS tree, if at the 
Time its endpoints are matched one of them is a copy of an undiscovered vertex. 

2.3 Useful Notations and Theorems of HACKMOSO 

Let vt be the vertex that has a copy with maximum index t. Denote by Pms{t) the probability that another 
edge outgoing from vt appears in the BPS tree. Namely, vt is the vertex that was discovered at Time t, and 
we are interested in the probability that another vertex is discovered through vt- 

Using the expectation of [t] as approximation we get from Equation (7) and Lemma 3 of IIACKM05I1 
the following Theorem: Q 

Theorem 1 Let G be a connected graph and let {flj } be a degree distribution that is upper bounded by a 
power-law with an exponent larger than 2. Let fi = jaj be the mean degree of G. Then it holds that 



3 The Degree Distribution of the Sampled Graph 

Our goal in this section is to show that the bias observed in BPS tree sampling regarding the degree dis- 
tribution of the sampled graph is not significant, when the underlying graph degree distribution obeys a 
power-law with an exponent 7 > 2. We show the above by examining the BPS tree received by the sam- 
pling process of IACKM051 . described in the previous section. Pinding a BPS tree from a single source 
is an idealization of the BGP data collection process. Thus, if the bias is negligible when using such BPS 
process, then we argue that it is very likely to be negligible when using the more general case of BGP. 

Recall that we focus on a BPS process on a random graph G. Let T be the BPS tree received by this 
process. Let degxiv) denote the degree of node v in the BPS tree received from the sampled graph. Let 
degciv) denote the degree of node v in the graph. 

Definition 2 We say that a vertex v has a high-degree if degdv) > 18. We say that an event occurs with 
high probability (w.h.p) if it occurs with probability at least 0.16. 

The main theorem we prove is the following: 

'in Lemma 3 of IACKM05I there is a requirement that aj = for j < 3. This imphes that with high probability the graph 
is connected. However, if we assume that the graph is connected, which is the case we are interested in, this requirement can be 
relaxed. 




(1) 
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Theorem 2 If the underlying graph degree distribution obeys a power-law with an exponent 7, where 2 < 
7 < 3, then with w.h.p the degree distribution of the high-degree vertices of the sampled graph also follows 
a power-law, with the same exponent value. 

We prove Theorem |2] using several Lemmas and Theorems. Throughout the remainder of this paper we 
use the following setting: 

Definition 3 Let = C ■ be a degree distribution, where 2 < 7 < 3 and C > is a constant 
normalization factor. Let /x denote the mean graph degree, i.e., /i = E[degG{v)\ = '^jjdj- 

Note that since the degree distribution is a power-law with an exponent larger than 2, in this case /x is finite. 
Our starting point is Theorem [T] [ACKM051. Our first step is to approximate the following sum, which 
appears in Equation ([U, when the degree distribution obeys the power-law of Definition [3j 

^kakt'' = C^kk-^t^ = C^k^-^t^. (2) 

k k k 

Lemma 1 Let /x, ak, and 7 be as in Definition \3\ Then 

^ 7-2' 



for allO<t<l. 
Proof: 

/oo 

Let g{t) = /~ x^-^i^dx. Then for alH > 1 we have that the ith derivative of g{t) is 

/oo 
x^-^Yl{x-j)t''-'dx. 
j=0 

Let us now evaluate g, and its derivatives, at the boundaries of [0, 1]: 

• 9(0) = 0, 

• £,W(0) = OforalH > 1, 
. g{l) = f^x'-^dx= 2^x2-7 

• 5^(1) = oofor alH > 2. 

We will approximate g{t) by an interpolation polynomial f{t) = Y^^bit^. Obviously, no polynomial has 
= oofor any i. Thus we will use g's values at t = 0, 1, and the derivatives at 0. The minimal-degree 
non-trivial polynomial we can use is a cubic f{t) = 60 + bit + 62^^ + ^3^^- Since g{0) = we get that 
60 = 0. g'{0) = implies bi = 0, and g"{0) = imphes 62 = 0. Since g{l) = we get that 63 = :p2- 
Thus f{t) = tV(7 - 2) and 

/•oo J.3 
g(t)= / X^-^fdx^ -. ■ (3) 



oo_ ^ 



1 



7-2' 



7-2 



Notes: 
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• Lemma [T] approximates the sum of Equation dill using a cubic polynomial. This is a somewhat arbi- 
trary choice: one can use any polynomial of degree d > 3 and obtain similar results. Using higher- 
degree polynomials yields a better-quality approximation around t = since additional derivatives 
are approximated, but does not necessarily improve the accuracy of the approximation at t = 1 since 
we can only use g{l) itself. Since we mostly care about the early stages in the evolution, near Time 
t = 1, we only present the result for the special case of d = 3. 

• At present we give no bound on the error of our polynomial approximation. Instead, to validate our 
results, in Section |4] we present a weaker result proven rigorously without using the approximation. 

• No polynomial can to give g^'^\l) = oo for any i, so our approximation accuracy is fundamentally 
limited around t = 1. It would be interesting, and technically more difficult, to approximate the sum 
using rational functions or other functions with an asymptote at t = 1. We leave this to future work. 

Lemma 2 Let fi, a^, and 7 be as in Definition\3\ Then fi{'y — 2) ~ C. 

Proof: 

/.(7-2) = (7-2)^A;afe = (7-2)C^A:A;-^ 

k k 

/•oo 

= (7-2)C7 Vyfci-^ « (7-2)C7 / x^-^dx 



k 

(7 - 2)C ^x"-^ 
2-7 



= (7-2)C - = C . (4) 

1 7-2 



(The last equality is valid for 7 > 2). ■ 

In addition to being a building block in proving our main Theorem, the following Lemma|3]is important 
since it shows that most of the edges are detected early during in the sampling process, near Time t = \. 

Recall that Pvis{t) is the probability that the vertex discovered at Time t gives rise to another edge in 
the BPS tree — i.e., not the edge it was detected with. 

Lemma 3 Pvis{t) ~ 

Proof: Recall that = C ■ k~'^. Using equation ([T|l and Lemma[T]we further approximate Pvis- 

k 

1 / > . nn.tJ \ 

Pvis (^) 

2 




and by Lemma [2] we have that 

7-2 



k 



^k'~^fK (5) 



Let w = t'^. By substituting w in equation ^ we get 



Pvisit) - ^E^'"^"^' 



t3 
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Using Lemma[T] again we get that 



^ — 2 fW\^ 3 



Note that at Time t = Lemma [3] gives Pyis ^ 0, which is as expected: at the end of the BFS process 
no new tree-edges are detected. Furthermore, at Time t = 1 we get P^is ?a 1, again matching our intuition 
that at the beginning of the BFS process the edges detected very Ukely to be tree edges. Moreover, observe 
that most edges are detected at the beginning of the BFS process. 

Recall that in the BFS sampling process of IIACKM05II each copy of a vertex v is assigned a Time index 
t e [0, 1] (Section [ 



Definition 4 Let max-index(?;) be the maximum index of a vertex v, where the maximum is taken over all 
copies of V. 

Lemma 4 Let v be a vertex with graph degree i and let max-index(w) = t. Then 
E[degT{v) \ degdv) = i, max-index(z;) = t] ^ [i — l)t^ 

Proof: We follow the discussion in IACKM05II . and neglect the possibility of self-loops and parallel edges 
involving a vertex v and its siblings, and ignore the fact that we are choosing without replacement (i.e., 
that processing each copy slightly changes the number of undetected vertices and the number of unmatched 
copies). Under these assumptions, the events that each of v's siblings give rise to edges that will be de- 
tected in the tree are independent, and LACKMOSil shows that the number of visible edges is approximately 
binomially distributed as Bin{i — l,P„js(i)). Therefore 

E [degxiv) \ degdv) = i, max-index(i;) = t] = {i — l)Pyis{t). (8) 

Thus, using Lemma[3]it holds that 

E [degriv) | degdv) = i, max-index(v) = t] ^ {i — l)t^. (9) 



Theorem 3 Let v be a vertex with graph degree i. Then 

E [degTiv) I degciv) = i] ^ ^^^^ (10) 

Proof: Since t = max-index(t;) is the maximum of i independent uniform variables in [0, 1], its probability 
density is df^/dt = it'^~^. Therefore, using LemmalU we get 

E [degTiv) \ degdv) = i] = ^ kVi [degriv) = k \ degdv) = i] 

k 

= ^ / ^i^'^Pr [degTiv) = k | degciv) = z, max-index(v) = t]dt 
u Jo 
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/ ^ [degriv) = k \ degdv) = z,max-index(u) = t]dt 

^0 k 

/ if^^E [degxiv] \ degdv) = i, max-index(t;) = t]dt 
Jo 

if-\i-l)t^dt = i(i-l) / f-^+^dt = (11) 



JO 



i + 3 



This completes the Theorem. 

i+3 - 7' 



Note: For high-degree nodes 7:pT > |, so for high-degree nodes we have that 



E [degriv] \ degdv) = > ^(i - 1) (12) 



id—l) 

Theorem 4 Let m = 2{i+z) • Then for every node v it holds that 



Pr [degxiv) > m \ deg^iv) = i] > 1 — e(i), 



'('-!) 

where e{i) = e s{»+3) . 



Proof: Recall that the events that each of v's copies give rise to a visible edge are approximately independent. 
Therefore we can use the Chernoff bound (cf. IIMR90II ). Let E = E [degxiv) \ degdv) = i]. Then using 
Theorem [3l we get 

i(i-l) 

Pr [degriv) < m \ degdv) = i] < Pr [degT{v) < E/2 \ degdv) = i] < e'^'^ k. e . 



Note: For high-degree nodes {i > 18) we have that e{i) > 0.16, and for i > 32 we have that e(i) > 0.03. 

Our main Theorem is now a corollary of Theorem HI 
Proof of Theorem^ As a result of Theorem |4] we get that w.h.p for high-degree nodes 

^(^ — 1) 

^ + vrr^ - '^^3t{v) < I 

2{i + 3) 

where i = degdv). Since for high-degree nodes i^^^ > f (i — 1), we have that w.h.p for high-degree 
nodes 

l + -{i-l) < degdv) <i 

Therefore w.h.p for high-degree nodes degriv) « 1 + c(i — 1), where c is a constant s.t | < c < 1. Thus 
w.h.p for high-degree nodes 



Pr [degxiv) = k] ^Fi 



k — 1 

degdv) = h 1 



k-1 



-7 



+ 1 oc k-^. (13) 



8 



4 A More Rigorous Analysis 



Our analysis of the bias, and especially Lemma [H used a somewhat cavalier polynomial approximation. 
In this section we give an alternative derivation of the conservation of the power law tail behavior without 
relying on the polynomial approximation of the sum. We use a more rigorous approach, but we show a 
weaker result — that validates the approximations up to multiplicative constants for large k. 

Lemma 5 7/7 > 2 then Ct < C Y.k k^^'^t^ < fJ'Jor all < t < 1. 

Proof: All summands are positive, so the sum is larger than the first summand. Also the sum is increasing 
with t and equals /U for i = 1. | 

Lemma 6 PyisU) > ^for all < t < 1 

Proof: Recall that = C ■ k~'^. Using equation ([Til and Lemma|5]we further approximate P^s as follows. 

1 .kfEjjajt^Y 



k 

> ^. (14) 

Theorem 5 For large enough k there exists ci > such that 

cik^-^ < Pr[degT{v) >k]< Ck^'^ 

for a random v. 

Proof: The upper value follows immediately from the fact that the visible degree is at most the graph degree. 
By Lemma[6]we have that Pvis{t) > / li^ for all < i < 1. Therefore, for a random v it follows that 

E[degT{v)] > -^degdv) 

and hence, 

E[degG{v) - degriv)] - ~ degdv). 
By the Markov inequality this means that 

(1 - degciv) 

Pr[degG{v) — degxiv) > a] < 



a 



Take a = ^1 — ^ + e^ degdv) for some constant e. The probability of a node v to have such a difference 
between its tree-degree and its graph-degree is at most some constant less than 1, and therefore, a constant 



9 




1e-006 I • • • • 

1 10 100 1000 10000 

Degree 



Figure 1: CCDF graphs for BPS trees and raw DIMES data 



Data Source 


7 


Group 1 


2.101 


Group 2 


2.079 


Group 3 


2.072 


Raw DIMES Data 


2.126 



Table 1: Sampled power law exponent 7 for each group. 



fraction of the nodes have a degree proportional to the original degree. Therefore, the tail of the distribution 
has a power law with exponent at least 7. ■ 

Notice that for all 7' < 7, there exists some large K^,, such that C'K2 > CK]. Therefore, the exponent 
of the power law can not decrease throughout the entire degree sequence. 

In fact, since the high degree nodes are discovered almost surely at f 1, we expect to see ~ 1 for 
these nodes, and therefore, the behavior of the tail is almost unchanged. Giving an exact bound near t = I 
is deferred to a future work. 

5 Simulation Results 

To further validate our analysis, we conducted a simulation study. We used the data collected by Shavitt and 
Shir ISS05 1 in the DIMES project as our underlying graph. 

To test whether the choice of the BPS tree root has a noticeable effect on the resulting degree distribution, 
the graph vertices were split into the following 3 groups, based on their graph degree: 

1. Low-degree nodes: 1 < degciv) < 35, 

2. Medium-degree nodes: 36 < degc{v) < 70, 
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3. High-degree nodes: degc{v) > 71. 

From each group we selected 10 nodes at random and constructed a BFS tree for each, where the selected 
node was a tree root. An average CCDli of degree distribution was then calculated for every group. We 
compared the resulting curves to the original connectivity data, collected by DIMES. 

Figure [T] shows the plotted CCDF curves for the three groups, and the curve for the raw DIMES data. 
The figure clearly shows the familiar power-law curves in all cases, and we can see that the curves are almost 
parallel graphs, indicating a similar value of the power-law exponent 7. 

Tabled] contains the computed values of 7 for each group. We can immediately see that the values of 7 
on the sampled trees (2.072-2.101) are very close to the true power-law exponent (2.126), thus validating 
our analysis that the bias is minor. Furthermore, Table [T] shows that the 7 values for the three groups are all 
close to one another, with a minor decrease in value as the degree of the root grows. Thus, it seems that the 
value of the power-law exponent in the sampled tree is largely invariant to the degree of the tree root. 

6 Conclusions and Future Work 

We have shown that if the underlying graph degree distribution obeys a power-law with an exponent 7 > 2 
(as is the case in the AS-graph) then with w.h.p the degree distribution of the high-degree vertices of the 
sampled graph also follows a power-law, with the same exponent value. Therefore, the bias observed in tree- 
sampling regarding the degree distribution of the sampled graph is not significant under these conditions. 
Furthermore, since according the non-tree-sampled data of [iSS05 l the AS-graph degree distribution does 
obey a power-law with an exponent 7 between 2 and 3, we conclude that the bias observed in the degree 
distribution of the BGP data is negligible. Thus, the commonly held view of the Internet's topology as 
having a degree distribution of a power-law form with an exponent 2 < 7 < 3 seems to be correct, and 
unlikely to be a by-product of the BGP data collection process. 

Acknowledgment: we thank Sagy Bar for producing the CCDF graphs from the DIMES data. 
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